Skip to content

Add opentelemetry to Armada#4973

Open
nikola-jokic wants to merge 1 commit into
masterfrom
nikola-jokic/otel
Open

Add opentelemetry to Armada#4973
nikola-jokic wants to merge 1 commit into
masterfrom
nikola-jokic/otel

Conversation

@nikola-jokic

Copy link
Copy Markdown
Contributor

What type of PR is this?

Enhancement

What this PR does / why we need it

Improve observability of armada services and their interactions

@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds OpenTelemetry distributed tracing to all Armada services, wiring up OTLP exporters, gRPC/HTTP instrumentation, a span attribute policy processor, and a local dev stack with Jaeger + otel-collector.

  • OTel lifecycle: observability.InitOTel / ShutdownOTel are called in each service's main.go or startup function; the tracer provider is fail-open (unreachable collector falls back to noop) and protected by a mutex.
  • Instrumentation: otelgrpc stats handlers are added server-side in grpc.go and client-side in connection.go; REST is wrapped with otelhttp in gateway.go; Logger.WithContext injects trace_id/span_id into log lines.
  • Attribute policy: a new SpanAttributePolicyProcessor enforces a deny-list (PII/sensitive keys), allow-list (rpc.*, http.*, armada.*, etc.), and a 1000-value cardinality cap on non-exempt attributes.

Confidence Score: 3/5

Two concrete defects need fixing before merge: duplicate otelgrpc handlers on one code path produce double spans, and the cardinality guard will silently corrupt job-level trace attributes in any production-scale deployment.

The duplicate otelgrpc.NewClientHandler() registration in server.go + connection.go means every RPC from the server's internal API connection generates two client spans and doubles gRPC metrics. The cardinality tracker caps all armada.* attributes at 1000 unique values; since job IDs are unique per job, traces will show [HIGH_CARDINALITY] for job-identifying attributes after the first thousand jobs, making the tracing essentially useless for its primary use case in production.

internal/server/server.go (duplicate stats handler), internal/common/observability/attribute_policy.go (cardinality exemption for armada.* attributes)

Important Files Changed

Filename Overview
internal/common/observability/lifecycle.go New file: initializes/shuts down global OTel tracer provider with mutex-protected state, fail-open design, and protocol-aware exporter selection. Previously flagged issues addressed in this version.
internal/common/observability/attribute_policy.go New file: span attribute sanitization processor. Two issues: armada.* attributes not cardinality-exempt (job IDs replaced after 1000 unique values), and bare 'key' in deniedContains over-matches legitimate names.
internal/common/observability/config.go New file: ObservabilityConfig struct and validation — correct and well-tested.
internal/server/server.go Adds otelgrpc client handler, but connection.go already adds one unconditionally — duplicate stats handlers cause double spans.
pkg/client/connection.go Adds otelgrpc.NewClientHandler() as a base dial option; safe on its own but causes duplication when callers also pass their own WithStatsHandler.
internal/common/grpc/grpc.go Adds otelgrpc server-side stats handler and fixes prometheus registerer usage — both correct.
internal/common/grpc/gateway.go Wraps REST gateway with otelhttp; refactors stripPrefix branching. Correct.
internal/common/logging/logger.go Adds WithContext injecting trace_id/span_id into log entries. Correct and guarded.
internal/testsuite/trace_evidence.go New file: in-memory OTel trace collector for integration tests.
_local/compose/full.yaml Adds Jaeger and otel-collector to local dev stack. Correct.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Svc as Armada Service
    participant OTel as InitOTel()
    participant TP as TracerProvider
    participant OTLP as OTLP Exporter
    participant Coll as OTel Collector
    participant Jaeger as Jaeger UI

    Svc->>OTel: InitOTel(cfg)
    OTel->>OTLP: newTraceExporter(ctx, cfg)
    OTLP-->>OTel: exporter (http or grpc)
    OTel->>TP: NewTracerProvider(AttributePolicyProcessor, BatchSpanProcessor, Sampler)
    OTel-->>Svc: otel.SetTracerProvider(tp)

    Svc->>TP: tracer.Start(ctx, operation)
    TP->>TP: SpanAttributePolicyProcessor.OnStart
    TP-->>Svc: span

    Svc->>TP: span.End()
    TP->>TP: SpanAttributePolicyProcessor.OnEnd
    TP->>OTLP: BatchSpanProcessor export
    OTLP->>Coll: OTLP HTTP/gRPC
    Coll->>Jaeger: OTLP gRPC

    Svc->>OTel: ShutdownOTel(ctx)
    OTel->>TP: tp.Shutdown(ctx)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Svc as Armada Service
    participant OTel as InitOTel()
    participant TP as TracerProvider
    participant OTLP as OTLP Exporter
    participant Coll as OTel Collector
    participant Jaeger as Jaeger UI

    Svc->>OTel: InitOTel(cfg)
    OTel->>OTLP: newTraceExporter(ctx, cfg)
    OTLP-->>OTel: exporter (http or grpc)
    OTel->>TP: NewTracerProvider(AttributePolicyProcessor, BatchSpanProcessor, Sampler)
    OTel-->>Svc: otel.SetTracerProvider(tp)

    Svc->>TP: tracer.Start(ctx, operation)
    TP->>TP: SpanAttributePolicyProcessor.OnStart
    TP-->>Svc: span

    Svc->>TP: span.End()
    TP->>TP: SpanAttributePolicyProcessor.OnEnd
    TP->>OTLP: BatchSpanProcessor export
    OTLP->>Coll: OTLP HTTP/gRPC
    Coll->>Jaeger: OTLP gRPC

    Svc->>OTel: ShutdownOTel(ctx)
    OTel->>TP: tp.Shutdown(ctx)
Loading

Reviews (5): Last reviewed commit: "Implement otel and integrate with all se..." | Re-trigger Greptile

Comment thread internal/common/observability/lifecycle.go Outdated
Comment thread internal/common/observability/attribute_policy.go
Comment thread internal/testsuite/trace_evidence.go
Comment thread internal/common/observability/lifecycle.go Outdated
@datadog-armadaproject

datadog-armadaproject Bot commented Jun 19, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

CI | All jobs succeeded   View in Datadog   GitHub Actions

CI | test / Golang Integration Tests   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 53aefc4 | Docs | Give us feedback!

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Comment on lines +82 to +83
deniedContains: []string{"password", "secret", "token", "api_key", "apikey", "key"},
cardinalityExemptPrefixes: []string{"rpc.", "http.", "net.", "server.", "service."},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 armada.* attributes not cardinality-exempt — job IDs will be silently replaced in production

armada. is in allowedPrefixes but absent from cardinalityExemptPrefixes. Any high-cardinality armada.* attribute — e.g. armada.job_id or armada.run_id, which are unique per submitted job — will hit the 1000-value cap after the first thousand distinct jobs and be replaced with [HIGH_CARDINALITY] for every subsequent job. In a production Armada cluster that processes more than 1000 jobs between restarts, all job-level trace attributes would become unreadable, directly undermining the observability goal of this PR. armada. should be added to cardinalityExemptPrefixes (or the limit for that prefix set much higher), and truly high-cardinality keys like armada.job_id should be listed explicitly in cardinalityExemptExact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant